Neural network accelerator tile architecture with three-dimensional stacking

ABSTRACT

A three dimensional neural network accelerator that includes a first neural network accelerator tile that includes a first transmission coil, and a second neural network accelerator tile that includes a second transmission coil, wherein the first neural network accelerator tile is adjacent to and aligned vertically with the second neural network accelerator tile, and wherein the first transmission coil is configured to wirelessly communicate with the second transmission coil via inductive coupling.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.15/625,810, entitled “NEURAL NETWORK ACCELERATOR TILE ARCHITECTURE WITHTHREE-DIMENSIONAL STACKING,” filed Jun. 16, 2017, which is incorporatedherein by reference in its entirety.

FIELD

This specification generally relates to accelerating neural networkcomputations in hardware.

BACKGROUND

Neural networks are machine learning models that employ one or morelayers of nonlinear units to predict an output for a received input.Some neural networks include one or more hidden layers in addition to anoutput layer. The output of each hidden layer is used as input to thenext layer in the network, i.e., the next hidden layer or the outputlayer. Each layer of the network generates an output from a receivedinput in accordance with current values of a respective set ofparameters.

SUMMARY

In general, one innovative aspect of the subject matter described inthis specification can be embodied in a three dimensional neural networkaccelerator that includes a first neural network accelerator tile havinga first transmission coil and a second neural network accelerator tilehaving a second transmission coil where the first neural networkaccelerator tile is adjacent to and aligned vertically with the secondneural network accelerator tile, the first transmission coil isconfigured to establish wireless communication with the secondtransmission coil via inductive coupling, and the first neural networkaccelerator tile and the second neural network accelerator tile areconfigured to accelerate a computation of a neural network by forming,through the established wireless communication, a static interconnectsystem that includes a communication scheme providing for anuninterruptible flow of data.

These and other implementations can each optionally include one or moreof the following features: the first neural network accelerator tile isincluded in a first array of tiles on a first neural network acceleratorchip, the second neural network accelerator tile is included in a secondarray of tiles on a second neural network accelerator chip, the firsttransmission coil is further configured to provide a digital logicinterconnection between the first neural network accelerator tile andthe second neural network accelerator tile through Near Field WirelessCommunication, the first transmission coil further comprises a ThruChipInterface (TCI) receiver and a TCI transmitter, the TCI receiver isconfigured to receive wireless communication from the secondtransmission coil, the TCI transmitter is configured to transmitwireless communication to the second transmission coil, the first neuralnetwork accelerator tile further comprises a processing element and aring-bus, the processing element, the first transmission coil, the TCIreceiver, the TCI transmitter are communicably connected through thering-bus, processing element includes circuitry to perform neuralnetwork computations in hardware, the first transmission coil is furtherconfigured to establish a TCI connection with the second transmissioncoil to form a vertical ring-bus, the first neural network acceleratortile further comprises a shorting plane to prevent interference fromother transmission coils, the first neural network accelerator tile isrotated 180 degrees with respect to the second neural networkaccelerator tile, and the first neural network accelerator tile andsecond neural network accelerator tile are oriented the same.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages. A three-dimensionally stacked neural networkaccelerator that has increased on-chip memory capacity to, for example,hold larger models. Additional advantages over other three dimensionalstacking solutions include lower cost, higher bandwidth, more compact,and increased scalability.

The details of one or more implementations of the subject matterdescribed in this specification are set forth in the accompanyingdrawings and the description below. Other potential features, aspects,and advantages of the subject matter will become apparent from thedescription, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-C are block diagrams of an example neural network acceleratortile.

FIG. 2 illustrates an example three-dimensionally stacked neural networkaccelerator with two neural network accelerator chips.

FIG. 3 illustrates another example three-dimensionally stacked neuralnetwork accelerator with two neural network accelerator chips.

FIG. 4 illustrates yet another example three-dimensionally stackedneural network accelerator with two neural network accelerator chips.

FIG. 5 illustrates an example three-dimensionally stacked neural networkaccelerator with a vertical ring-bus implementation for a mediumbandwidth design.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

Technology is rapidly progressing in the areas of robotics, the internetof things, and other areas that use machine learning algorithms. Forexample, facial recognition and user preference determinationtechnologies use machine learning techniques, such as neural networks,to increase result accuracy. Neural network computations may beperformed using general purpose graphics processing units,field-programmable gate arrays, application-specific chips, and otherhardware of the like. As neural network models increase in size andcomplexity, they require more computational resources for execution. Tohandle the increase in computational resources, large-scale hardwareneural network accelerators may be employed.

Described herein are architectures for a neural network accelerator. Aneural network accelerator is a hardware computing system that isconfigured to accelerate the computation of a neural network, i.e., theprocessing of an input using the neural network to generate an output.Neural network accelerators may be fabricated by stacking neural networkdies (chips) that each include an array of interconnected neural networkaccelerator tiles. In some implementations, each neural network tilewithin an array on a neural network chip is communicably coupled to oneanother via a planar ring-bus embedding. Once cut, the neural networkchips may be three dimensionally stacked to form a neural networkaccelerator. When stacked, at least one neural network tile within thearray of tiles on one neural network chip may be communicably linkedwirelessly to a respective neural network tile on another neural networkchip that is stacked just above or below the first chip. The linkedneural network tiles form a static interconnect system. In someimplementations, the formed static interconnect system is organized as alinear sequence of processing through the respective neural networkaccelerator tiles. The linear pipeline of processing times through thelinear sequence starts and ends in a special controller referred to asan un-core. The un-core is a collection of functional blocks, which maydeal with input/output (I/O) to a host computer, interface to off-chipsmemory, connect to I/O devices, and/or perform synchronization,coordination, and buffer functions.

A neural network accelerator may be fabricated through wafer-levelstacking where wafers are stacked on top of one another and bondedtogether. A wafer is a thin slice of semiconductor material (e.g.,silicon, gallium nitride, etc.) that is typically round and may bebetween 300 or 450 millimeters in diameter. Each wafer has a series ofdies (or chips) that each include an array of neural network acceleratortiles. The dies (and their tiles) are aligned as the wafers are stackedand bonded. When stacked, the neural network accelerator tiles ondifferent chips may be communicatively coupled to each other throughwireless communication (i.e., inductive coupling using TCI technology)or through vertical interconnects, such as through silicon vias (TSVs).The stacked wafers are then cut into die-stacks, which are the neuralnetwork accelerators.

Each neural network accelerator tile is self-contained and canindependently execute computations required by a portion of amulti-layer neural network. A neural network accelerator tile includes aprocessing element (or processor, processor core), a memory, and aring-bus coupled to transmission coils. The transmission coils may beconfigured to communicate inductively to the transmission coils of anadjacent tile that is stacked directly above or below. The processingelement is configured to perform computations required to process neuralnetwork computations in hardware. For example, the processing elementmay perform one or more neural network building block computations inhardware, e.g., matrix multiplies, computations of activation functions,pooling, softmax or logistic regression computations, and so on. Examplearchitectures for a processing element included in a neural networkaccelerator tile are described in U.S. patent application Ser. No.15/335,769, which is incorporated herein by reference.

During the fabrication of a neural network accelerator, the neuralnetwork accelerator chips/dies are stacked in such a manner as to allowfor wireless communication between the chips through the embedded neuralnetwork tiles. The neural network accelerator tile supports thisthree-dimensional scaling by enabling wireless communication betweenstacked tiles through the embedded transmission coils. In someimplementations, the wireless communication between stacked tiles isbased on the ThruChip wireless interconnect technology (ThruChipwireless interconnect technology is described in more detail in“Low-Cost 3D Chip Stacking with ThruChip Wireless Connections”, by DaveDitzel, Aug. 11, 2014). For example, the transmission coils may be apair of loops that provide a TCI. In some implementations, thetransmission coils are constructed with a conventionalmetal-oxide-semiconductor (CMOS) process above logic and/or memory areasof the neural network accelerator tile. When a neural networkaccelerator tile is stacked on to other tiles (i.e., the respectivedies/chips are stacked), the TCIs (coils) allow data to be sent toand/or received from neural network accelerator tiles above or below therespective neural network accelerator tile. In some implementations, asshown in FIGS. 2 and 3, at least one tile site in the array isconfigured to receive wireless transmission from a respective tile siteon a neural network chip that is stacked directly above or directlybelow and another tile site is configured to send wireless transmissionto a respective tile site on the same neural network chip. In someimplementations, as shown in FIG. 4, one tile site in the array isconfigured to both receive and send wireless transmission from/to arespective tile site on a neural network chip that is stacked directlyabove or directly below.

A neural network accelerator chip also includes other on-chip circuitrywithin the un-core such as I/O interface circuitry to couple data in andout of the array of tiles, clock distribution circuitry to provide clocksignals to the processing elements of the tiles and other interface andcontrol functions, and so on. For example, an interface may be to a hostcomputer. Such an interface may be replicated on all chips in a threedimensional stack or the interface may be delegated to a second chipthat employs a different processing node that is coupled to thethree-dimensional stack via the TCIs.

A neural network accelerator chip may route data between each tileaccording to a sequence formed through a static interconnect system. Forexample, data may be received at one computing tile in the staticinterconnect system, processed, and the output of the tile then sent toand received by a next tile in the sequence within the staticinterconnect system. The next tile then processes the received input.This process is repeated by each tile in the sequence.

FIG. 1A is a block diagram of an example neural network accelerator tile100. The example tile 100 includes a processing element 110, ring-bus120, transmission coils 130, TCI receivers 142, and TCI transmitters140. Neural network accelerator tile 100 may be fabricated on a waferwithin an array of like neural network accelerators. The array of neuralnetwork accelerators may be included in a fabricated die on the wafer.The tile processor element (or processor core(s)) 110 may includefunctional units, memory, a data path, and control logic, which are usedto perform calculation and control functions. In some implementations,the transmission coils 130 are fabricated above processing element 110(i.e., the logic and/or memory areas) of the neural network acceleratortile 100 to maximize area savings.

Ring-bus 120 represents the interconnection of the tile 100 componentssuch as the processing element 110, transmission coils 130, TCIreceivers 142, and TCI transmitters 140 as well as the interconnectionbetween other neural network accelerators fabricated within the same die(i.e., within the same tile array). In some implementations, thering-bus 120 is a portion of a planar embedded ring-bus on therespective neural network chip that connects the tiles within an arrayto form a Hamiltonian circuit in a directed, bipartite graph where eachprocessing tile is represented by one input and one output vertex andwhere the processing unit is the edge that connects the input to theoutput. For ring-bus 120, possible multiplexer configurations may berepresented by the multitude of edges that connect certain outputs tocertain inputs. In some implementations, to facilitate a linear seriesof tiles as part of the planar embedding, ring-bus 120 enters the tile100 on one side and leaves it on the opposite side.

As described above, tile 100 is an individual computing unit that may beincluded within an array of like tiles on a neural network acceleratorchip. In some implementations, tile 100 may be communicatively coupledto one or more adjacent tiles, which may be stacked to form a staticinterconnect system within a three dimensionally stacked neural networkaccelerator. The stacked tiles are employed to distribute thecomputation of a neural network across the three dimensionally stackedneural network accelerator. For example, each tile 100 communicates withone or more adjacent tiles (i.e., tiles that are above or below andconnected wirelessly or tiles within the same tile array on a neuralnetwork chip connected through the planar ring-bus) to form the staticinterconnect system. The interconnect system can be configured so thatthe processing tiles are part of one or more ring-busses, such asring-bus 120, that encompass the computational resources of thethree-dimensional stack. Such a configuration allows the tiles in athree-dimensional stack of chips to be utilized efficiently and providesflexibility to reorganize the computational resources into multiplerings if demanded by the application.

Transmission coils 130 are embedded in the tile 100 and provide for TCIconnections, which are received by the respective TCI receiver 142 andtransmitted by the respective TCI transmitter 140. The transmissioncoils 130 employ inductive coupling using magnetic fields to enable NearField Wireless Communication between the transmission coils 130 of othertiles 100 that are, for example, stacked three-dimensionally above orbelow the respective tile. The enabled Near Field Wireless Communicationprovides for digital logic interconnections between thethree-dimensionally stacked neural network accelerator chips. In someimplementations, a tile 100 may employ the established Near FieldWireless Communication to communicate with an adjacent tile that isabove or below tile 100 in the three dimensional stack. The transmissioncoils 130 may be offset from one another as shown in FIG. 1A such thatwhen two tiles are stacked, the respective transmission coils do notinterfere with the transmissions between other coils. Together, thetransmission coils 130, the TCI receivers 142, and the TCI transmitters140 form a TCI. Such a TCI is small relative to tile 100 such that thearea needed for the TCI connection is smaller than that of comparableTSV. For example, in a contemporary process node with feature sizesbelow 20 nanometers (nm), a bandwidth in excess of 50 gigabits persecond (Gb/s) is realizable. The actual speed is subject to engineeringconsiderations, such as power and the complexity of theserializer/deserializer (SERDES) logic. For example, TCI coil sizedepends of the thickness of stacked dies. Current thinning technologyhas demonstrated 2.6 micrometers (mm) die thickness for a coil size of 3times 2.6 mm or about 8 mm on a side. A more conservative die thicknesswould be 4 mm with a coil size of approximately 12 mm.

For example, a tile may be of the order of 1 by 1 mm and have room forapproximately 6000 TCI's. A tile 100 with a high bandwidth design mayinclude a number of TCIs (the transmission coils 130, the TCI receivers142, and the TCI transmitters 140) that cover a significant fraction ofthis tile area. For example, a group of TCIs may be operated at 20Gb/sec and require approximately 50 TCIs to send data from the ring-bus120 and another 50 TCIs to receive data for the ring-bus 120.

A tile with a medium bandwidth design includes a number of TCIs thatcover a smaller portion of the tile area. For example, the die thicknessmay be increased to approximately 15 mm and the tile 100 may includeapproximately 20-30 TCIs. In such an example, transmission coils 130 mayhave a 45 mm side-length and yield approximately 400 possible TCI sites.The TCIs may be placed in a linear row on half of an edge of a 1 by 1 mmtile where both the TCI transmitter 140 and the TCI receiver 142 arenear the interface side of the tile and run at less than 10 Gb/sec. Anexample medium bandwidth design configurations is depicted in FIG. 4.

In some implementations, tile 100 includes a portion 120 of a planarring-bus. The planar ring-bus communicable couples each tile in thearray on a neural network chip. The ring-bus has approximately 2000wires run from one tile to the next (i.e., point-to-point) and carry abandwidth between 0.25 to 0.5 gigabits per second (Gb/s) each. Thering-bus width is the number of wires that make up the ring-bus. Forexample, each tile on a chip sends out data on the approximately 2000wires and has another set of approximately 2000 wires incoming from theprevious tile.

In such implementations, the signaling rate of a TCI for tile 100 may bebetween 20 to 40 Gb/s. In some implementations, TCIs may run at a highrate to conserve power because the transmitter draws a constant amountof power, independent of the actual data rate, due to constant currentswitching. The coil size is a function of the individual die thickness.Tile 100 may be thinned down to between 2.6 and 10 micrometers. Thiscorresponds to a TCI coil edge length of 12 to 30 micrometers or threetimes the chip-to-chip distance.

For a high bandwidth design, a tile thickness at the upper range ofthickness (10 mm), a fast signaling rate, and a low multiplexing ratiomay be used. For some implementations of a high bandwidth design, theTCIs on tile 100 can either transmit or receive data at the ring-busrate, but not both. In such implementations, a bandwidth assumption mayuse a larger number of TCIs (of the available approximately 6000 TCIsper tile) so that there is enough room on one tile for enough TCIs totransmit or receive the bandwidth equivalent to one ring-bus connection.Example high bandwidth design configurations are depicted in FIGS. 2 and3.

FIG. 1B is a block diagram of an abstract representation of a tile 100.The abstract representation of tile 100 in FIG. 1B includes processingelement 110 and a set of TCIs represented by a circle 150. The set ofTCIs 150 for tile 100 includes the transmission coils 130, the TCIreceivers 142, and the TCI transmitters 140 from FIG. 1A.

FIG. 1C is a block diagram of another abstract representation of a tile100. The abstract representation of tile 100 in FIG. 1C includesprocessing element 110, two sets of TCIs represented by circles 150, andmultiplexer 160. The set of TCIs 150 for tile 100 includes thetransmission coils 130, the TCI receivers 142, and the TCI transmitters140 from FIG. 1A grouped as two separate sets. Multiplexer 160 governswhich TCI set is transmitting and which is receiving and is controlledstatically by, for example, a configuration register. As alluded toabove, the number of possible TCI sites for one tile can be quite large(approximately 6000), thus each of the two circles represent a group ofTCIs configured to be either transmitters or receivers (consistent withthe symbology of FIG. 1B). The abstract representations in FIGS. 1B and1C are used in FIGS. 2-5.

FIG. 2 illustrates an example three-dimensionally stacked neural networkaccelerator 200 with two neural network accelerator chips 220 and 222. Astack of two chips is depicted; however, any number of chips (layers)may be used. Neural network accelerator chips 220 and 222 include neuralnetwork accelerator tiles 100, which include one TCI set (as shown inFIG. 1B). In the depicted example, neural network accelerator chips 220and 222 are placed on top of each other in the same orientation, suchthat the ring-bus (240, 242) for each respective neural networkaccelerator chip 220 and 222 run parallel and in the same direction. TCIdata connections 232 provide communication between accelerator chips 220and 222 through adjacent tiles 100 using inductive coupling as describedabove. Crossover point 230 is where the TCI data connections 232 areused to route the ring-buses 240 and 242 between the network acceleratorchips 220 and 222. Crossover point 230 is created by stitching thering-buses 240 and 242 to one ring that encompasses all tiles 100 ofboth network accelerator chips 220 and 222. The one ring communicativelycouples tile 100 of both neural network accelerator chips 220 and 222.In the depicted example, a single pair of TCI data connections 232 isshown; however, any number of pairs of TCI data connections 232 may beused formed between the neural network accelerator chips 220 and 222.Each pair of tiles that may participate in a vertical data exchange hastwo sets of wires connecting these tiles (crossover points 230), whichmay require double the amount of wires (i.e., 4000 instead of 2000).

FIG. 3 illustrates an example three-dimensionally stacked neural networkaccelerator 300 with two neural network accelerator chips 320 and 322. Astack of two chips is depicted; however, any number of chips (layers)may be used. Neural network accelerator chips 320 and 322 include neuralnetwork accelerator tiles 100, which include one TCI set (as shown inFIG. 1B). In the depicted example, neural network accelerator chips 320and 322 are placed on top of each other but with the orientation rotated180 degrees with respect to each other. Similar to FIG. 2, the TCI dataconnections 332 provide communication between accelerator chips 320 and322 through adjacent tiles 100 using inductive coupling.

In the depicted example, with some minor constraints (e.g., avoidingrotational symmetric layouts) on the planar ring-buses 340 and 342embedding, the rotated neural network accelerator chips 320 and 322cause the respective ring-buses 340 and 342 to run in oppositedirections at the crossover site 330. Constraints in the locations ofthe TCI sites in the disclosed design allow for the vertical alignmentof TCIs even when two chips are rotated 180 degrees when stacked.Additionally, the layout depicted in FIG. 3 alleviates one chip fromhaving two sets of ring-bus wires, as depicted in FIG. 2, at thecrossover site 330 to carry the data traffic while the other chip doesnot use any wires. This configuration may reduce wiring cost, which canexceed the cost of multiplexers that implement a ring-bus crossover.Additionally, the layout in FIG. 3 may reduce routing overhead. In thedepicted example, a single pair of TCI data connections 332 is shown;however, any number of pairs of TCI data connections 332 may be formedbetween the neural network accelerator chips 320 and 322. Such a designallows for the formation of multiple, independent rings, which might beneeded in some applications.

FIG. 4 illustrates an example three-dimensionally stacked neural networkaccelerator 400 with two neural network accelerator chips 420 and 422. Astack of two chips is depicted; however, any number of chips (layers)may be used. Neural network accelerator chips 420 and 422 include neuralnetwork accelerator tiles 100, which include two TCI sets (as shown inFIG. 1C). In the depicted example, neural network accelerator chips 420and 422 are placed on top of each and stacked in the same orientation.TCI data connections 432 are established between the TCI sets in a pairof adjacent tiles 100 and provide communication between acceleratorchips 420 and 422 through the two adjacent tiles 100 using inductivecoupling as described above. By employing two TCI sets in the tile 100,the crossover is localized to just one tile site. This configuration mayalleviate the need for long wires to span the entire tile. Instead, thedepicted accelerator 400 may employ a symmetry breaking bit in the tileconfiguration that control the multiplexer and that governs which TCIset is transmitting and which is receiving. In the depicted example, asingle pair of tiles is used to form TCI data connections 432 is shown;however, any number of pairs of TCI data connections 432 may be usedformed between the neural network accelerator chips 420 and 422.

FIG. 5 illustrates example three-dimensionally stacked neural networkaccelerator 500 with a vertical ring-bus implementation for a highbandwidth design. The depicted example shows three stacked neuralnetwork accelerator chips 510, 520, and 530 with TCI connections 542 and544 between the chips. The TCI connection 542 is between tile site 512on chip 510 and tile site 524 on chip 520. The TCI connection 544 isbetween tile site 522 on chip 520 and tile site 532 on chip 530. In thedepicted example case, each tile site 512, 522, 524, and 532 forms onevertical ring-bus that interconnects all the tiles on all stacked chipsthat share the same tile positions (i.e., each column of tiles isconnected as one ring). Each stacked chip 510, 520, and 530 is rotatedby 90 degrees with respect to the preceding chip in the stack. Thering-bus connections form a bifilar spiral through the stack. The top(or bottom) reflect the ring-bus to close the ring. In someimplementations, two processing tiles are combined into one virtual tileof this column so that at least one processing tile is traversed on theway up and another one on the way down. To control the number of tilesin the ring independently of the number of chips in the stack, thevirtual tiles that make up one vertical spiral may group a larger (even)number of tile processors. In the depicted example, the bottom layer 510may include an interface to a host computer and/or a ring-buscontroller, while the chips that make up rest of the stack are pure tilearrays. Such an arrangement provides additional TCI based verticalbusses that can be used to broadcast control signals to all tilessimultaneously, avoiding the delay associated with running a wire allthe way across a chip. In some implementations, rings may be stitchedtogether on the controller tile 510 to create longer rings with moretiles. Such a configuration, provides for dynamic changing thecontroller to tile ratio. In the depicted example, shorting planes 518,528, and 538 are employed to prevent interference from TCI coilsreaching beyond the next chip. In some implementations, shorting planes518, 528, and 538 are a solid metal plane or a dense grid, which canserve to shorten the range of the TCI without imposing a significantcost increase in the overall fabrication process.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various system modulesand components in the embodiments described above should not beunderstood as requiring such separation in all embodiments, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In certain implementations, multitasking and parallelprocessing may be advantageous.

What is claimed is: 1.-20. (canceled)
 21. A three dimensional neuralnetwork accelerator comprising: a first neural network accelerator chipcomprising a first array of tiles that includes a first neural networkaccelerator tile comprising a first transmission coil; a second neuralnetwork accelerator chip, adjacent to and aligned vertically with thefirst neural network accelerator tile, the second neural networkaccelerator chip comprising a second array of tiles that includes asecond neural network accelerator tile comprising a second transmissioncoil configured to establish wireless communication with the firsttransmission coil via inductive coupling, wherein each tile of the firstarray of tiles and the second array of tiles is a self-containedcomponent that can independently execute computations for the threedimensional neural network accelerator; and an un-core controllercomprising one or more functional blocks, wherein the first array oftiles and the second array of tiles are configured to accelerate acomputation of a neural network by forming, through wirelesscommunication established between the first transmission coil and thesecond transmission coil, a static interconnect system organized as alinear sequence of processing that starts and ends in the un-corecontroller.
 22. The three dimensional neural network accelerator ofclaim 21, wherein the linear sequence of processing is through the firstarray of tiles and the second array of tiles.
 23. The three dimensionalneural network accelerator of claim 21, wherein the one or morefunctional blocks are configured to handle one or more of: (i)input/output (I/O) to a host computer, (ii) interface to off-chipsmemory, (iii) connecting to I/O devices, or (iv) performingsynchronization, coordination, and buffer functions.
 24. The threedimensional neural network accelerator of claim 21, wherein the firsttransmission coil is further configured to provide a digital logicinterconnection between the first neural network accelerator tile andthe second neural network accelerator tile through Near Field WirelessCommunication.
 25. The three dimensional neural network accelerator ofclaim 21, wherein each tile of the first array of tiles and the secondarray of tiles comprises a processing element and a memory.
 26. Thethree dimensional neural network accelerator of claim 25, wherein thefirst transmission coil further comprises a ThruChip Interface (TCI)receiver and a TCI transmitter, wherein the TCI receiver is configuredto receive wireless communication from the second transmission coil, andwherein the TCI transmitter is configured to transmit wirelesscommunication to the second transmission coil.
 27. The three dimensionalneural network accelerator of claim 26, wherein the first neural networkaccelerator tile further comprises a ring-bus, wherein the processingelement of the first neural network accelerator tile, the firsttransmission coil, the TCI receiver, and the TCI transmitter arecommunicably connected through the ring-bus.
 28. The three dimensionalneural network accelerator of claim 25, wherein each of the processingelements includes circuitry to perform neural network computations inhardware.
 29. The three dimensional neural network accelerator of claim21, wherein the first transmission coil is further configured toestablish a ThruChip Interface (TCI) connection with the secondtransmission coil to form a vertical ring-bus.
 30. The three dimensionalneural network accelerator of claim 21, wherein the first neural networkaccelerator tile further comprises a shorting plane to preventinterference from other transmission coils.
 31. The three dimensionalneural network accelerator of claim 21, wherein the first neural networkaccelerator chip and second neural network accelerator chip are orientedthe same.
 32. A method for fabricating a neural network accelerator, themethod comprising: stacking a first neural network accelerator chip anda second neural network accelerator chip, wherein the first neuralnetwork accelerator chip comprises a first array of tiles that includesa first neural network accelerator tile comprising a first transmissioncoil, wherein the second neural network accelerator chip comprising asecond array of tiles that includes a second neural network acceleratortile comprising a second transmission coil configured to wirelesslycommunicate with the first transmission coil via inductive coupling,wherein each tile of the first array of tiles and the second array oftiles is a self-contained component that can independently executecomputations for the neural network accelerator, wherein the firstneural network accelerator tile is adjacent to and aligned verticallywith the second neural network accelerator tile, wherein the first arrayof tiles and the second array of tiles are configured to accelerate acomputation of a neural network by forming, through wirelesscommunication established between the first transmission coil and thesecond transmission coil, a static interconnect system organized as alinear sequence of processing that starts and ends in an un-corecontroller, and wherein the un-core controller comprises one or morefunctional blocks.
 33. The method of claim 32, wherein the linearsequence of processing is through the first array of tiles and thesecond array of tiles.
 34. The method of claim 32, wherein, wherein theone or more functional blocks are configured to handle one or more of:(i) input/output (I/O) to a host computer, (ii) interface to off-chipsmemory, (iii) connecting to I/O devices, or (iv) performingsynchronization, coordination, and buffer functions.
 35. The method ofclaim 32, wherein each tile of the first array of tiles and the secondarray of tiles comprises a processing element and a memory.
 36. Themethod of claim 32, wherein the first transmission coil is furtherconfigured to provide a digital logic interconnection between the firstneural network accelerator tile and the second neural networkaccelerator tile through Near Field Wireless Communication.
 37. Themethod of claim 32, wherein the first transmission coil furthercomprises a ThruChip Interface (TCI) receiver and a TCI transmitter,wherein the TCI receiver is configured to receive wireless communicationfrom the second transmission coil, and wherein the TCI transmitter isconfigured to transmit wireless communication to the second transmissioncoil.
 38. The method of claim 32, wherein the first transmission coil isfurther configured to establish a ThruChip Interface (TCI) connectionwith the second transmission coil to form a vertical ring-bus.
 39. Themethod of claim 38, wherein the second neural network accelerator chipis rotated 90 degrees with respect to the first neural networkaccelerator chip, and wherein the vertical ring-bus forms a bifilarspiral through the stack of the first neural network accelerator tileand the second neural network accelerator tile.
 40. The method of claim39, wherein the first neural network accelerator tile further comprisesa shorting plane to prevent interference from other transmission coils.